Dealing with word-internal modification and spelling variation in data-driven lemmatization
نویسندگان
چکیده
This paper describes our contribution to two challenges in data-driven lemmatization. We approach lemmatization in the framework of a two-stage process, where first lemma candidates are generated and afterwards a ranker chooses the most probable lemma from these candidates. The first challenge is that languages with rich morphology like Modern German can feature morphological changes of different kinds, in particular word-internal modification. This makes the generation of the correct lemma a harder task than just removing suffixes (stemming). The second challenge that we address is spelling variation as it appears in non-standard texts. We experiment with different generators that are specifically tailored to deal with these two challenges. We show in an oracle setting that there is a possible increase in lemmatization accuracy of 14% with our methods to generate lemma candidates on Middle Low German, a group of historical dialects of German (1200–1650 AD). Using a log-linear model to choose the correct lemma from the set, we obtain an actual increase of 5.56%.
منابع مشابه
LGeRM: lemmatization of Middle French words
Unlike most modern languages, Middle French is a language whose spelling is not yet stabilized. There is a great deal of variation in the spelling of a word and accordingly the traditional methods for lemmatization cannot be used. LGeRM (lemmes, graphies et règles morphologiques) proposes a solution based on a databank containing known lemmatized spellings and a set of graphical and morphologic...
متن کاملWeigh your words - memory-based lemmatization for Middle Dutch
This article deals with the lemmatization of Middle Dutch literature. This text collection—like any other medieval corpus—is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual va...
متن کاملSyllabification of Middle Dutch
The study of spelling variation can be seen as a window allowing us to understand the phonological systems of the dialects of Middle Dutch, and to what extent they differed. Syllabic information is of great help in the study of spelling variation, but manual annotation of large corpora is a labor-intensive task. We present a method for automatic syllabification of words in Middle Dutch texts. W...
متن کاملDesign and implementation of Persian spelling detection and correction system based on Semantic
Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors. Also developing Persian tools will provide Persian progr...
متن کاملSimple Data-Driven Context-Sensitive Lemmatization
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a S...
متن کامل